GNNMF: a multi-view graph neural network for ATAC-seq motif finding

Background The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) utilizes the Transposase Tn5 to probe open chromatic, which simultaneously reveals multiple transcription factor binding sites (TFBSs) compared to traditional technologies. Deep learning (DL) technology, including convolutional neural networks (CNNs), has successfully found motifs from ATAC-seq data. Due to the limitation of the width of convolutional kernels, the existing models only find motifs with fixed lengths. A Graph neural network (GNN) can work on non-Euclidean data, which has the potential to find ATAC-seq motifs with different lengths. However, the existing GNN models ignored the relationships among ATAC-seq sequences, and their parameter settings should be improved. Results In this study, we proposed a novel GNN model named GNNMF to find ATAC-seq motifs via GNN and background coexisting probability. Our experiment has been conducted on 200 human datasets and 80 mouse datasets, demonstrated that GNNMF has improved the area of eight metrics radar scores of 4.92% and 6.81% respectively, and found more motifs than did the existing models. Conclusions In this study, we developed a novel model named GNNMF for finding multiple ATAC-seq motifs. GNNMF built a multi-view heterogeneous graph by using ATAC-seq sequences, and utilized background coexisting probability and the iterloss to find different lengths of ATAC-seq motifs and optimize the parameter sets. Compared to existing models, GNNMF achieved the best performance on TFBS prediction and ATAC-seq motif finding, which demonstrates that our improvement is available for ATAC-seq motif finding. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10218-0.


Background
Transcription factors (TFs) are proteins that can bind to DNA sequences and play a crucial role in gene-regulated networks [1], cell cycle regulation [2], and human diseases [3].The region where TFs bind to DNA sequences is called transcription factor binding sites (TFBSs), which are short and conserved DNA fragments [4].Transcription regulation is carried out by the interplay between TFs and TFBSs in DNA sequences, and identifying TFBSs aids us to reveal functions of TFs and the cause of human diseases [5].The aligned TFBSs of the same TF can be defined as a regulatory DNA motif, which can be represented by a position weight matrix (PWM) [6].Assays for Transposase-Accessible Chromatin using sequencing (ATAC-seq) can probe open chromatin via Tn5, which is an effective method for locating TFBSs at a genome-wide level [7,8].Compared to Chromatin immunoprecipitation sequencing (ChIP-seq) and DNase I hypersensitive sites sequencing (DNase-seq) technologies, ATAC-seq can reveal more kinds of transcription factor binding regions [9].An ATAC-seq footprint is a fragment of a DNA sequence that has not been cleaved due to the transcription factors binding.TOBIAS and Hint-ATAC are feasible tools for identifying ATAC-seq footprints, TOBIAS utilizes bias correction [10] and footprinting scores to locate footprints, and Hint-ATAC employs the hidden Markov model to locate footprints [11].ATAC-seq footprints have been applied to TF network prediction [11], comparison of TF activity [12], identification of TFs enriched in peripheral blood mononuclear cells (PBMC)-specific peaks [13], and ATAC-seq motifs finding [14].
There are several special tools for ATAC-seq motif scanning, such as TRACE, chromVAR, SnapATAC [15][16][17].They scan input sequences by using known motifs, which limits their ability to find ATAC-seq motifs.DL technology, including convolutional neural networks (CNNs) and the recurrent neural networks has achieved successes in protein-protein networks, gene-regulated networks, and motifs finding [18][19][20].ATAC-seq motif finding includes two key steps: the first step is to predict an ATAC-seq sequence that contains TFBSs, i.e.TFBS prediction, and the second step is to find ATAC-seq motifs.Existing models such as scFAN, FactorNet, and DeepATAC employ CNNs to predict TFBSs and find motifs from ATAC-seq data [21,22].These models utilized convolutional kernels of CNNs to scan ATAC-seq sequences, and a fully connected layer to predict TFBSs.Due to the limitation of the width of convolutional kernels, these models only find motifs with fixed lengths.Moreover, these existing models do not consider the coexisting probability of TFBSs in an input sequence.In a related line of research, the graph neural network (GNN) can learn the key message from the graph-structured data [23].GNNs can learn the embeddings of nodes via their neighboring nodes, and keep the connection of the graph unchanged and nodes' embedding can also enable graph-based explanation and reasoning.GNNs work on non-Euclidean structured data and have been shown to perform well on molecular structures [24], protein-protein networks [25], gene-gene networks [26], ATAC-seq motif finding [14], etc. MMGraph is an ATAC-seq motif predictor based on GNN and coexisting probability to find multiple motifs [14].MMGraph utilizes ATAC-seq footprints to build a multi-view heterogeneous graph by defining coexisting, Hamming, jaccard, and inclusive edges.Compared with CNN-based models, MMGraph utilizes ATAC-seq footprints to build the multi-view heterogeneous graph and employs the coexisting probability to find multiple ATAC-seq motifs of different lengths.However, MMGraph still has certain defects.MMGraph sets the coexistence probability threshold to 0.5, which needs to be optimized.MMGraph uses only Hamming distance and coexisting probability to measure the relationships between k-mers, which ignored the correlation between k-mers.
To address the above issues, this study developed a novel model called GNNMF to find ATAC-seq motifs and predict TFBSs.GNNMF is based on MMGraph and has three improved aspects: built a multi-view heterogonous graph, which contains four kinds of edges (coexisting edges, similarity edges, jaccard edges, and inclusive edges) and two types of nodes (k-mer nodes and sequence nodes); defined the iterloss function to improve TFBS prediction accuracy; optimized the threshold of coexisting probability via defining coexisting probability between k-mer nodes in negative sequences.We tested the effectiveness of the GNNMF model on 200 human ATAC-seq datasets and 80 mouse ATAC-seq datasets across nine evaluation metrics, including the area of eight metrics radar (AEMR), precision, recall, F1_score, accuracy (ACC), specificity, the Matthews correlation coefficient (MCC), the area under the receiver operating characteristic curve (AUC), and the area under the precision-recall curve (PRC).According to the experimental results, GNNMF improved the ACC, MCC, and AUC by 9.6%, 18.2%, and 2.9%, respectively, on 200 human ATACseq datasets and by 2.36%, 6.35%, and 3.01%, respectively, on 80 mouse ATAC-seq datasets.In this study, AEMR is an overall score for evaluating the model's ability to predict TFBSs.GNNMF improved the AEMR by 4.92% and 6.81% on 200 human and 80 mouse ATAC-seq datasets, respectively.Moreover, among all the models, GNNMF find the most 385 and 662 significant motifs from human ATAC-seq data and mouse ATAC-seq data , respectively.

Data processing
The 280 ATAC-seq datasets, including 200 human ATAC-seq datasets and 80 mouse ATAC-seq datasets, were randomly downloaded from the ENCODE project (Supplementary Table S1).To obtain corrected footprints, footprints of each dataset are obtained via TOBIAS and Hint-ATAC tools with default parameters [11,12].All footprints are ranked in descending order by their scores.Then the top 1500 footprints are used to intersect with the footprints that Hint-ATAC obtains, and the intersected footprints of each dataset are applied to the following analysis.
Each intersected footprint is pruned with 101 bps around its center by the bedtools [27], which is set as a positive sequence.The TFBS prediction is a binary classification, so we shuffle all bases within a positive sequence as a negative sequence [28].This paper gives a positive sequence a label of '1' , and gives a negative sequence a label of '0' .Thus, we can obtain an ATAC-seq sequence set S that includes n sequences.
For each dataset, 80% of S is set as training data, 10% of S is set as validation data, and the remaining S is set as testing data.Then, the sequence s(•) ( s(•) ∈ S ) is trimmed into k-mers k(•) by the step of one base to obtain a k-mer set Ks(•)(lenk = length(k(•)) ).s(•) and k-mer k(•) will be utilized to build the multi-view heterogeneous graph G , where s(•) and k(•) are nodes.

Building the multi-view heterogeneous graph
The multi-view heterogeneous graph G contains two kinds of nodes (sequence node s(•) and k-mer node k(•) ), and four types of edges (coexisting edge, similarity edge, Jaccard edge and inclusive edge) (Fig. 1A).Coexisting edges (view 1) represent coexisting relationships between k-mer nodes in a sequence, and the weights of coexisting edges are calculated by the coexisting probability [14].Similarity edges (view 2) represent relations between k-mer nodes and the weights of similarity edges are measured by Hamming distances [29].Jaccard edges (view 3) represent relationships between k-mer nodes, and the weights of Jaccard edges are measured by the Jaccard correlation coefficient.Inclusive edges represent inclusive relations between a given sequence s(•) and its constituent k-mers, and weights of inclusive edges are measured by TF-IDF (term frequency-inverse document frequency) [30].
We let Ks(i) be a k-mer set that contains all unique k(•) within s(i) .K is an m k-mer set that contains all unique k(•) in all Ks(i) of S , where m is the total number of unique k(•) within S .Four edge types are defined to build G , where s(•) and k(•) are nodes.
Coexisting edges measure the coexisting probability between k(•) nodes.The coexisting edge weight between two k-mers k(p) and k(j) is calculated by Formula ( 1): (1) Fig. 1 The GNNMF framework where Wco represents a m × m coexisting weight matrix, num(k(•)) represents the total number of Ks(•) that con- tain k(•) , and nums(k(p), k(j)) is the total number of Ks(•) that contain both k(p) and k(j).Similarity edges measure mismatches between k(•) nodes using the Hamming distance [31].The similarity edge weight between k(p) and k(j) is calculated by Formula ( 2 Jaccard edges measure the correlation between k(•) nodes using the Jaccard correlation coefficient.The Jaccard edge weight between two nodes k(p) and k(j) is calculated by Formula (3): where Sk(•) presents a sequence set that contains all sequences, including k(•) .Wjac represents a m × m simi- larity weight matrix, p , j ∈ [1, ..., m].

GNNMF overview
In this study, we developed a three-layer GNN model named GNNMF to find ATAC-seq motifs.GNNMF decomposes the multi-view heterogeneous graph G into four sub- graphs, i.e. a coexisting subgraph, similarity subgraph, Jaccard subgraph, and inclusive subgraph (Fig. 1B).The first layer of GNNMF is utilized to learn the embedding of k(•) as Ek(k(•)) ∈ R dc+ds+dj where dc , ds and dj are the embed- the coexisting subgraph, similarity subgraph and Jaccard subgraph, respectively.The second layer learns the embedding of s(•) as Es(s(•)) ( Es(s(•)) ∈ R dsq ) from the inclusive subgraph, where dsq is the dimension of Es(s(•)) .The third layer is a fully connected layer, which is used to predict TFBSs.
where W co (W co ∈ R m×dc ) is a training weight matrix, which is randomly initialized.ReLU (•) represents the rec- tified linear unit, which is an activation function.
where Esim(k(p)) is the embedding of k(p) in the simi- larity subgraph.W sim (W sim ∈ R m×ds ) is a training weight matrix, which is randomly initialized.
GNNMF learns the embedding of k(p) as Ejac(k(p)) from the Jaccard subgraph, and Ejac(k(p)) is calculated by Formula (11).
where W jac (W jac ∈ R m×dj ) is a training weight matrix that is randomly initialized.
where W (W ∈ R n×dsq ) is the training weight matrix, which is randomly initialized.b represents the bias, which is a n × 1 vector, sigmoid(•) is a sigmoid func- tion, and ŷ(s(i)) represents the predicted label of s(i) by GNNMF.
The BCEloss is the binary cross-entropy loss, which is used to calculate the binary cross-entropy loss between the predicted probability and the true label.Based on BCEloss, we define the iterloss as the loss function of GNNMF [33]: where y(s(i)) represents the true label of sequence s(i) (Formula ( 19)); ŷ(s(i)) represents the predicted label of sequence s(i) ; and epoch represents the iteration, BCEloss(s(i)) 0 = 0.

Evaluation metrics
Our evaluation metrics include precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR [34].AEMR is the area of a radar chart, which can be generated that consists of eight equiangular spokes with each spoke representing one of the scores defined above [14].The higher the AEMR score is, the better the performance of model for TFBS prediction.
All the evaluation metrics are calculated via the following formulas: (20) where TP, TN, FP and FN represent the number of the true positive (TP), true negative (TN), false positive (FP), and false negative (FN), respectively.The AUC is the area under the receiver operating characteristic curve, and the PRC is the area under the precision-recall curve, which is a value between 0 and 1.

Experiment settings
This study trained GNNMF for 30 epochs using the Adam optimizer [35] (Fig. 2).The hyperparameters of GNNMF included the learning rate, dsim , dco , djac , dsq and lenk , and their ranges were tested in this study (Supplementary Table S2).We utilized the grid search to choose the optimal parameters and applied the combinations of those hyperparameters to GNNMF model.The AUC is used to measure the performance of GNNMF with different combinations on the testing set.The learning rate of GNNMF was set as 0.001 and was adjusted by the natural exponential decay with 0.001.When we trained GNNMF on the training set, the sequence nodes in validation set and the testing set were masked.The dsim , dco and djac were set as 50, and dsq was set as 150 (26) R = [precision, recall, F 1_score, ACC, specificity, MCC, AUC, PRC, precision] AEMR = sum(O i,i+1 ) i = 1, ..., 8 in our experiment.Considering the computational complexity and the performance of GNNMF, we set lenk = 5 .The scFAN, DeepATAC, FactorNet, MMGraph, and MMGraph+jac models are used as comparison tools, where MMGraph+jac combines MMGraph and jaccard graph.The evaluation metrics, including precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR, are utilized to evaluate models' performance on TFBS prediction.To explain the GNNMF model, ATAC-seq motifs are used to show features that GNNMF learned.We matched the found motifs to motifs that the HOCO-MOCO database contains via the TOMTOM tool.

TFBS prediction
In this study, we utilized nine metrics to evaluate the performance of all the models on 200 human ATACseq datasets and 80 mouse ATAC-seq datasets (Fig. 3).FactorNeT, scFAN, DeepATAC, MMGraph, and MMGraph+jac were selected as comparison tools, because they achieved feasible performances in the previous study.In the light of results, AEMRs of models based on the GNN model are higher than those of the models based on the CNN model, which demonstrates that the Fig. 2 The workflow of the experimental setting GNN model is more suitable for ATAC-seq TFBS prediction.On 200 human ATAC-seq datasets, GNNMF achieved the highest average AEMR score (2.13).Compared to existing models, GNNMF improved the AEMR by 4.92%.Among models based on CNNs, DeepATAC obtained an AEMR of 1.59 , which is higher than those of scFAN and FactorNeT.On 80 mouse ATAC-seq datasets, GNNMF achieved consistent results on 200 human ATAC-seq datasets, and improved the AEMR by 6.81% (Supplementary Figure S1).On 280 ATAC-seq datasets, GNNMF yielded the highest AEMR score, which indicated that our improvement was efficient.

Finding multiple ATAC-seq motifs
ATAC-seq can reveal all opening chromatin, which means that there are multiple motifs in an ATAC-seq dataset.In this study, we developed the GNNMF model to find ATAC-seq motifs by employing the GNN and coexisting probability.Existing DL models to find ATAC-seq motifs, including FactorNet, scFAN, DeepATAC, and MMGraph, were selecteds as comparison tools.We tested the above models on 280 ATAC-seq datasets including 200 human ATAC-seq datasets and 80 mouse ATAC-seq datasets.This study matches the found motifs to the HOCOMOCO database via the TOMTOM v5.1.0tool.The p-value less than 0.05 indicated that the found motifs are significant.All motifs that each model found are listed in Table 1, 401 human ATAC-seq motifs were found.Among all the models, GNNMF found the most ATAC-seq motifs.Moreover, we tested all the models on 80 mouse ATAC-seq datasets, and the number of motifs in which each model found is listed in Supplementary Table S4.The p-value depicts the degree to which the found motifs are significant.The p-values of the found motifs of each model are shown in the violin plot, and the median value shows the model's ability to find ATAC-seq motifs (Fig. 4).GNNMF achieved the highest median value among all the models.Based on motif finding results, GNNMF found the most motifs among all the models.The GNN-based models outperformed the CNN-based models, which indicated that the GNN is a feasible algorithm for finding ATAC-seq motifs.GNNMF is based on MMGraph, which is improved via the Jaccard edge, background probability, and the iterloss function.Our results demonstrated that our improvements were conducive to finding ATAC-seq motifs.

Discussion
This study developed a GNN model called GNNMF by improving the MMGraph model.We improved the MMGraph by using the Jaccard edge, iterloss function, and background probability.GNNMF decomposed the multi-view heterogeneous graph into four subgraphs, and employed a three-layer GNN model to predict TFBSs.The first layer of GNNMF was used to learn the embedding of k-mer nodes via the similarity subgraph, Jaccard subgraph, and coexisting subgraph.The second layer was used for the embedding of sequence nodes via the inclusive subgraph, and the third layer was used to predict TFBSs.GNNMF utilizes embeddings of k-mer nodes and sequence nodes to define k-mer seeds and employs the coexisting probability to find multiple motifs with different lengths.Existing models to find ATAC-seq motifs, including FactorNet, scFAN, DeepATAC, MMGraph and MMGraph +jac, were selected as comparison tools.We tested all the models on 200 human ATAC-seq datasets and 80 mouse ATACseq datasets.We utilized the precision, recall, F1_score, specificity, ACC, MCC, AUC, and PRC to evaluate the models' ability in TFBS prediction.In light of our results, GNNMF achieved the highest precision, recall, F1_score, ACC, specificity, MCC, AUC, and PRC of 0.863, 0.846, 0.843, 0.845, 0.944, 0.709, 0.942, and 0.949, respectively, on 200 human datasets.On 88 mouse datasets, our proposed model obtained the highest precision, recall, F1_ score, ACC, specificity, MCC, AUC, and PRC values of 0.844, 0.831, 0.828, 0.830, 0.917, 0.676, 0.931, and 0.946, respectively.Moreover, the GNN-based model achieved higher AEMR scores than models based on CNN.Our results demonstrated that the GNN has more potential for TFBS prediction on ATAC-seq datasets.Compared to MMGraph+jac, GNNMF utilizes the iterloss to predict TFBSs.The AEMR of GNNMF was higher than that of MMGraph+jac, which indicated the iterloss has an advantage over BCEloss in TFBS prediction.
GNNMF utilized the embedding of k-mer nodes and sequence nodes, and calculated coexisting probability between k-mer nodes to find multiple motifs with different lengths.MMGraph and MMGraph+jac used the same way to set the background probability threshold, and they set the background probability threshold to 0.5.However, GNNMF defines the background probability threshold via negative sequences.Our results indicated that GNNMF found more and higher quality motifs than MMGraph and MMGraph+jac.Therefore, the background probability threshold of the negative sequences  aided GNNMF in finding ATAC-seq motifs.FactorNet, scFAN, and DeepATAC are all based on CNNs, they found ATAC-seq motifs by convolutional kernels in the first layer.But these models only found the fixed length of motifs, and the found motifs are similar.However, compared with models based on CNNs, models based on GNNs have great advantages.We utilized all the models to find ATAC-seq motifs and used p-value to evaluate the motifs' quality.The results demonstrated that GNNMF is the best model for finding multiple ATAC-seq motifs.GNNMF employs a three-layer GNN to predict whether given sequences are bound by TFs.Some novel algorithms can be applied in TFBS prediction, such as graph attention networks [36], graphGAN [37], and graph autoencoders [38].In the heterogeneous graph, there are many relationships between nodes and each kind of relationship represents the importance between two nodes.The attention mechanism can allocate and update the different weights to the nodes and edges of the heterogeneous graph, during the training process of the model.Moreover, TFs bind indirectly to motifs of other TFs, which co-regulate targeted gene expression [39].The cooperation of TFs acts as a vital role in the process of human disease [40].ATAC-seq data can detect open-accessible DNA regions by probing open chromatin, meaning that ATAC-seq data contain multiple TFs [41].By analyzing ATAC-seq data, we revealed interactions between TFs, and explored the inducement of human disease.Therefore, GNNMF is a potential tool for studying the cooperation among different TFs.

Conclusions
In this study, we developed a novel model named GNNMF for finding multiple ATAC-seq motifs.GNNMF built the multi-view heterogeneous graph by using ATAC-seq sequences, employed a three-layer of GNN to predict TFBSs, and utilized coexisting probability to find ATACseq motifs.We conducted experiments on 200 human and 80 mouse ATAC-seq datasets to analyze the effectiveness of the proposed method.Our evaluation metrics included precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR.In particular, GNNMF improved the AEMR by 4.92% and 6.81% on 200 human and 80 mouse ATAC-seq datasets, respectively.Meanwhile, GNNMF found multiple motifs from ATAC-seq data through the coexisting probability between k-mers.Regarding ATACseq data, our proposed method found more and higher quality motifs, which demonstrated methods based on coexisting probability of k-mers are more efficient than DL models.As a result, GNNMF achieved better performance than a few state-of-the-art methods.This study made great contributions to finding motifs from ATAC-seq data.

Fig. 4 p
Fig. 4 p-values of the found motifs of six models on 200 human ATAC-seq datasets

Table 1
Count of motifs that each model found on 200 human ATAC-seq datasets